Improving Semistatic Compression Via Pair-Based Coding

نویسندگان

  • Nieves R. Brisaboa
  • Antonio Fariña
  • Gonzalo Navarro
  • José R. Paramá
چکیده

In the last years, new semistatic word-based byte-oriented compressors, such as Plain and Tagged Huffman and the Dense Codes, have been used to improve the efficiency of text retrieval systems, while reducing the compressed collections to 30–35% of their original size. In this paper, we present a new semistatic compressor, called Pair-Based End-Tagged Dense Code (PETDC). PETDC compresses English texts to 27–28%, overcoming the optimal 0-order prefix-free semistatic compressor (Plain Huffman) in more than 3 percentage points. Moreover, PETDC permits also random decompression, and direct searches using fast Boyer-Moore algorithms. PETDC builds a vocabulary with both words and pairs of words. The basic idea in which PETDC is based is that, since each symbol in the vocabulary is given a codeword, compression is improved by replacing two words of the source text by a unique codeword.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving semistatic compression via phrase-based modeling

In the last years, new semistatic word-based byte-oriented text compressors, such as Tagged Huffman and those based on Dense Codes, have shown that it is possible to perform fast direct search over compressed text and decompression of arbitrary text passages over collections reduced to around 30-35% of their original size. Much of their success is due to the use of words as source symbols and a...

متن کامل

ISSDC: Digram Coding Based Lossless Data Compression Algorithm

In this paper, a new lossless data compression method that is based on digram coding is introduced. This data compression method uses semi-static dictionaries: All of the used characters and most frequently used two character blocks (digrams) in the source are found and inserted into a dictionary in the first-pass, compression is performed in the second-pass. This two-pass structure is repeated...

متن کامل

Housekeeping for prefix coding

We consider the problem of constructing and transmitting the prelude for Huffman coding. With careful organization of the required operations and an appropriate representation for the prelude, it is possible to make semistatic coding efficient even when , the size of the source alphabet, is of the same magnitude as , the length of the message being coded. The proposed structures are of direct r...

متن کامل

New adaptive compressors for natural language text

Semistatic byte-oriented word-based compression codes have been shown to be an attractive alternative to compress natural language text databases, because of the combination of speed, effectiveness, and direct searchability they offer. In particular, our recently proposed family of dense compression codes has been shown to be superior to the more traditional byte-oriented word-based Huffman cod...

متن کامل

E ective Variable - Length - to - Fixed - Length Coding via a Re - Pair Algorithm

We address the problem of improving variable-length-toxed-length codes (VF codes). A VF code is an encoding scheme that uses a xed-length code, and thus, one can easily access the compressed data. However, conventional VF codes usually have an inferior compression ratio to that of variable-length codes. Although a method proposed by T. Uemura et al. in 2010 achieves a good compression ratio com...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006